Due to the number of plots, this page may take a minute to load.

This project is to analyze the online activities of Microsoft Developer Network (MSDN) Forums, as shown below, relevant to the numbers of votes, views and replies. MSDN hosts the online communities for developers to post questions, find answers and exchange ideas relevant to Microsoft and open source products and technologies.

The goal is to dvelop a model for predicting votes of a discussion thread.

MSDN Forums Home Page

MSDN Forums Home Page

Scraping Data

I wrote a web scraping program to crawl the site for acquiring relevant data. The code was based on:

The site’s main data panel displays 20 entries, i.e. disucssion threads, per page. For 10,000 entries on 500 pages which apparently is a hard limit set by the site, the crawl process took about 50 minutes with the following Scrapy settings:

Although on page 500, the next age link remains shown and clickable. when accessing page 501 and beyond, the system consistently presnets an error message and without showing further forum data. The following is a screen capture showing what happened when accessing beyond page 500. Consequently, the number of pages to scrap was hard-coded as 500.

MSDN Forums Page 501 and Beyond

MSDN Forums Page 501 and Beyond

Eliminating Blank Lines in the Resulted csv File

The crawl process scraped about 10 MB of data saved in a csv file. The created csv file resulted with a blank line after every data line, as shown below. Before RStudio imported the csv file, a Python routine first had removed all the blank lines.

Blank line after each data line

Blank line after each data line

After eliminating the blank lines, RStudio successfully imported the file. The remining of this article highlights the follow-up effort to explore data, develop a model and examine the results.

Importing the Data File

The data frame created was with 10,000 observations of 17 variables.

# Spider's output
forums <- read.csv(file='msdnforums.csv')

dim(forums)
## [1] 10000    17
str(forums) 
## 'data.frame':    10000 obs. of  17 variables:
##  $ threadTitle    : Factor w/ 9895 levels "'Add-AzureRmApplicationGatewayHttpListener' & 'Add-AzureRmApplicationGatewayRequestRoutingRule' doing nothing",..: 3172 8214 1339 1364 4070 5886 7654 3838 1718 6026 ...
##  $ threadTitleLink: Factor w/ 9993 levels "https://social.msdn.microsoft.com/Forums/en-US/00017961-7db1-448b-b37a-752e5b399954/sp-form-integration-to-acce"| __truncated__,..: 880 1904 7108 9112 8905 231 4477 4045 9275 6501 ...
##  $ threadSummary  : Factor w/ 9911 levels "","' expected [D:\\home\\site\\wwwroot\\Microsoft.Bot.Sample.LuisBot.csproj]\n\nDialogs\\BasicLuisDialog.cs(110,74"| __truncated__,..: 5152 168 9904 4534 7266 3335 3511 4261 8443 1290 ...
##  $ category       : Factor w/ 37 levels ".NET Framework",..: 25 25 25 25 25 25 25 25 25 25 ...
##  $ categoryLink   : Factor w/ 37 levels "https://social.msdn.microsoft.com/Forums/en-US/home?category=apps",..: 24 24 24 24 24 24 24 24 24 24 ...
##  $ subCategory    : Factor w/ 148 levels ".NET Framework inside SQL Server",..: 114 114 114 114 114 114 114 114 114 114 ...
##  $ subCategoryLink: Factor w/ 148 levels "https://social.msdn.microsoft.com/Forums/en-US/home?forum=accessdev",..: 98 98 98 98 98 98 98 98 98 98 ...
##  $ votes          : int  0 0 0 0 0 65 0 0 0 0 ...
##  $ threadState    : Factor w/ 4 levels "Answered","Discussion",..: 4 4 1 4 4 2 4 4 4 4 ...
##  $ replyCount     : int  1 5 26 1 0 86 0 6 0 0 ...
##  $ viewCount      : int  24 102 289 32 2 2716 6 89 9 10 ...
##  $ createdByName  : Factor w/ 7343 levels "'''HuuM'''","--_--",..: 6411 6956 4459 5383 7141 3692 1256 2508 4128 4403 ...
##  $ createdByLink  : Factor w/ 9828 levels "https://social.msdn.microsoft.com:443/profile/'''huum'''?type=forum&referrer=http://social.msdn.microsoft.com/F"| __truncated__,..: 8608 9307 5930 7168 9556 4940 1676 3342 5509 5858 ...
##  $ createdByTime  : Factor w/ 9329 levels "1 hour 18 minutes ago",..: 3 38 724 35 14 3498 50 9 58 65 ...
##  $ lastReplyName  : Factor w/ 7343 levels "'''HuuM'''","--_--",..: 6411 6956 4459 5383 7141 3692 1256 2508 4128 4403 ...
##  $ lastReplyLink  : Factor w/ 9828 levels "https://social.msdn.microsoft.com:443/profile/'''huum'''?type=forum&referrer=http://social.msdn.microsoft.com/F"| __truncated__,..: 8608 9307 5930 7168 9556 4940 1676 3342 5509 5858 ...
##  $ lastReplyTime  : Factor w/ 9329 levels "1 hour 18 minutes ago",..: 3 38 724 35 14 3498 50 9 58 65 ...
sum(is.na(forums)) # Any missing values?
## [1] 0

And a grand view took first.

Further Examining and Preparaing Data

The analysis is mainly on the three numberic fields. They are view count, reply count and votes. Other than two with timestamps, those non-essential fields were removed at this time.

## [1] "Data fields originally scraped"
##  [1] "threadTitle"     "threadTitleLink" "threadSummary"  
##  [4] "category"        "categoryLink"    "subCategory"    
##  [7] "subCategoryLink" "votes"           "threadState"    
## [10] "replyCount"      "viewCount"       "createdByName"  
## [13] "createdByLink"   "createdByTime"   "lastReplyName"  
## [16] "lastReplyLink"   "lastReplyTime"
## [1] "Fields interested and kept"
## [1] "category"      "subCategory"   "threadState"   "replyCount"   
## [5] "viewCount"     "votes"         "createdByTime" "lastReplyTime"

and here were the first 10 rows of the processed dataset at this time.

category subCategory threadState replyCount viewCount votes createdByTime lastReplyTime
SQL Server SQL Server XML Unanswered 1 24 0 1 hour 36 minutes ago 1 hour 36 minutes ago
SQL Server SQL Server XML Unanswered 5 102 0 20 hours 44 minutes ago 20 hours 44 minutes ago
SQL Server SQL Server XML Answered 26 289 0 Friday, October 19, 2018 10:01 PM Friday, October 19, 2018 10:01 PM
SQL Server SQL Server XML Unanswered 1 32 0 2 hours 15 minutes ago 2 hours 15 minutes ago
SQL Server SQL Server XML Unanswered 0 2 0 12 minutes ago 12 minutes ago
SQL Server SQL Server XML Discussion 86 2716 65 Saturday, October 28, 2017 5:43 PM Saturday, October 28, 2017 5:43 PM
SQL Server SQL Server XML Unanswered 0 6 0 26 minutes ago 26 minutes ago
SQL Server SQL Server XML Unanswered 6 89 0 10 hours 45 minutes ago 10 hours 45 minutes ago
SQL Server SQL Server XML Unanswered 0 9 0 38 minutes ago 38 minutes ago
SQL Server SQL Server XML Unanswered 0 10 0 41 minutes ago 41 minutes ago

Converting Timestamps

Although this project does not include timeseries analysis, there are two fields with timestamps remain valuable for depicting the distribution of data overtime. The next is to convert these two timestamps:

into the R datatime format. There were however some complications. For those threads created or replied on a current day, the MSDN site used a non-standard data time format. Rather than, e.g.

it recorded something like

as the following:

head(forums$createdByTime)
## [1] 1 hour 36 minutes ago              20 hours 44 minutes ago           
## [3] Friday, October 19, 2018 10:01 PM  2 hours 15 minutes ago            
## [5] 12 minutes ago                     Saturday, October 28, 2017 5:43 PM
## 9329 Levels: 1 hour 18 minutes ago ... Wednesday, September 5, 2018 7:48 AM

These non-standard representations of timesamps must be converted to a standard date format for R to process.

To correct the problem, rows with these non-standard timestamps were replaced with a utc timestamp. This utc timestamp, as shown below, logged at the web-scraping applicaiton start-time, is placed in the 1st line of the output log for serving as a replacement for those rows with non-standard representations of time.

V1
2018-10-27 21:23:45.324523
Timestamp for post=rocessing use
**************************************************
{‘name’: ‘mstechforums’, ‘allow_urls’: [‘https://social.msdn.microsoft.com/’], ‘start_urls’: [‘https://social.msdn.microsoft.com/Forums/en-US/home’], ‘custom_settings’: {‘LOG_LEVEL’: ‘ERROR’}}
Hard-coded Total Pages to 500 due to an identified constraint of the site.
Crawl as You Go… :-P
{‘startThread’: 1, ‘threadsPerPage’: 20, ‘totalThreads’: 2852329, ‘totalPages’: 500}
[TARGET URL LIST]
https://social.msdn.microsoft.com/Forums/en-us/home?sort=lastpostdesc&brandIgnore=true&page=1
https://social.msdn.microsoft.com/Forums/en-us/home?sort=lastpostdesc&brandIgnore=true&page=2
https://social.msdn.microsoft.com/Forums/en-us/home?sort=lastpostdesc&brandIgnore=true&page=3
https://social.msdn.microsoft.com/Forums/en-us/home?sort=lastpostdesc&brandIgnore=true&page=4
https://social.msdn.microsoft.com/Forums/en-us/home?sort=lastpostdesc&brandIgnore=true&page=5
https://social.msdn.microsoft.com/Forums/en-us/home?sort=lastpostdesc&brandIgnore=true&page=6
https://social.msdn.microsoft.com/Forums/en-us/home?sort=lastpostdesc&brandIgnore=true&page=7
https://social.msdn.microsoft.com/Forums/en-us/home?sort=lastpostdesc&brandIgnore=true&page=8

Before the timestamp conversion

class(forums$createdByTime)
## [1] "factor"
head(forums$createdByTime)
## [1] 1 hour 36 minutes ago              20 hours 44 minutes ago           
## [3] Friday, October 19, 2018 10:01 PM  2 hours 15 minutes ago            
## [5] 12 minutes ago                     Saturday, October 28, 2017 5:43 PM
## 9329 Levels: 1 hour 18 minutes ago ... Wednesday, September 5, 2018 7:48 AM
class(forums$lastReplyTime)
## [1] "factor"
head(forums$lastReplyTime)
## [1] 1 hour 36 minutes ago              20 hours 44 minutes ago           
## [3] Friday, October 19, 2018 10:01 PM  2 hours 15 minutes ago            
## [5] 12 minutes ago                     Saturday, October 28, 2017 5:43 PM
## 9329 Levels: 1 hour 18 minutes ago ... Wednesday, September 5, 2018 7:48 AM

This utc timestamp in the log file was then read in and used for replacing those non-standard timestamps.

After the timestamp conversion

## [[1]]
## [1] 2018-10-27
## Levels: 2018-10-27
## 
## [[2]]
## [1] 21:23:45.324523
## Levels: 21:23:45.324523
## 
## [[3]]
## [1] "Friday, October 19, 2018 10:01 PM"
## 
## [[4]]
## [1] 2018-10-27
## Levels: 2018-10-27
## 
## [[5]]
## [1] 21:23:45.324523
## Levels: 21:23:45.324523
## 
## [[6]]
## [1] "Saturday, October 28, 2017 5:43 PM"
## [[1]]
## [1] 2018-10-27
## Levels: 2018-10-27
## 
## [[2]]
## [1] 21:23:45.324523
## Levels: 21:23:45.324523
## 
## [[3]]
## [1] "Friday, October 19, 2018 10:01 PM"
## 
## [[4]]
## [1] 2018-10-27
## Levels: 2018-10-27
## 
## [[5]]
## [1] 21:23:45.324523
## Levels: 21:23:45.324523
## 
## [[6]]
## [1] "Saturday, October 28, 2017 5:43 PM"

Now the data was prepared and ready.

Exploratory Data Analysis

An overview of the prepared dataset was the following.

First, reviewed the distribution of individual data fields overtime. The following plots are interactive and click the magify glass icon on top menu bar of the plot to zoom in for details of an observation.

One thing I’ve found interesting is that historically there seemed little evidence with churns of activities influenced by product announcements, releases, product end-of-life, cyber attacks, etc.

Realizing Correlation

##    viewCount        replyCount           votes          viewCount/votes
##  Min.   :     2   Min.   :   0.000   Min.   :  0.0000   Min.   :9.188  
##  1st Qu.:    80   1st Qu.:   1.000   1st Qu.:  0.0000   1st Qu.:  Inf  
##  Median :   132   Median :   3.000   Median :  0.0000   Median :  Inf  
##  Mean   :  1542   Mean   :   4.306   Mean   :  0.7986   Mean   :  Inf  
##  3rd Qu.:   219   3rd Qu.:   5.000   3rd Qu.:  0.0000   3rd Qu.:  Inf  
##  Max.   :485555   Max.   :1815.000   Max.   :425.0000   Max.   :  Inf
## Standard Deviation  11212.02 19.19455 6.806855 NaN
##                 viewCount replyCount     votes viewCount/votes
## viewCount       1.0000000  0.2083569 0.3092113             NaN
## replyCount      0.2083569  1.0000000 0.6677484             NaN
## votes           0.3092113  0.6677484 1.0000000             NaN
## viewCount/votes       NaN        NaN       NaN               1

The correlation between among the three is expected.

Took a closer view on:

viewCount and replyCount

viewCount and votes

replyCount and replyCount

A linear relationship seemed apparaent.

View All in 3D

Move the cursor to the blank area and click as needed, if the plot does not display itself autotmatically. Drag to rotate and zoom in for details of custered observations.

The blue ones were those threadss with no or few replies, and similarily few or no votes, despite they might had had a few views. They all clustered near the edge where replyCount and votes cordinate orgins interset.

Considering log-vireCount as x axis and log-votes as the y axis, the distribution of these observations indicated apparent lineraity.

Considering Multiple Linear Regression

The intent here is to investigate how a multiple linear regression model may fit.

Considering

\[\hat{votes} = \beta~0 + \beta~1 * viewCount + \beta~2 * replyCount\]

Here’s how this model looked:

Start with the first 200 observations

fit200 = lm( v200 ~ vc200 + rc200, data = vcrcv)
summary(fit200)
## 
## Call:
## lm(formula = v200 ~ vc200 + rc200, data = vcrcv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1657  -0.7237   0.6908   0.9338  19.1749 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.4799542  0.2045854  -7.234 1.02e-11 ***
## vc200       -0.0001145  0.0001396  -0.820    0.413    
## rc200        0.5536765  0.0257211  21.526  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.523 on 197 degrees of freedom
## Multiple R-squared:  0.708,  Adjusted R-squared:  0.705 
## F-statistic: 238.8 on 2 and 197 DF,  p-value: < 2.2e-16
cat('Confidence Interval ',confint(fit200))
## Confidence Interval  -1.883413 -0.0003898601 0.5029525 -1.076496 0.000160771 0.6044005
summary(fit200)$coefficients
##                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept) -1.4799542051 0.2045854084 -7.2339187 1.018614e-11
## vc200       -0.0001145445 0.0001396068 -0.8204799 4.129343e-01
## rc200        0.5536764976 0.0257210729 21.5261820 1.217897e-53
#print(c('beta0',summary(fit200)$coefficients[1,1]))
#print(c('beta1',summary(fit200)$coefficients[2,1]))
qqnorm(fit200$residuals)
qqline(fit200$residuals)

plot(fit200)

#influencePlot(fit1000)

Take the first 1000 observations

fit1000 = lm( v1000 ~ vc1000 + rc1000, data = vcrcv)
summary(fit1000)
## 
## Call:
## lm(formula = v1000 ~ vc1000 + rc1000, data = vcrcv)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.868 -0.532 -0.026  0.477 43.752 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.791e-01  7.107e-02  -6.741 2.65e-11 ***
## vc1000       1.355e-05  5.727e-06   2.365   0.0182 *  
## rc1000       2.522e-01  7.995e-03  31.549  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.017 on 997 degrees of freedom
## Multiple R-squared:  0.5604, Adjusted R-squared:  0.5595 
## F-statistic: 635.5 on 2 and 997 DF,  p-value: < 2.2e-16
cat('Confidence Interval ',confint(fit1000))
## Confidence Interval  -0.6185432 2.308652e-06 0.2365284 -0.33963 2.478417e-05 0.2679044
summary(fit1000)$coefficients
##                  Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept) -4.790866e-01 7.106625e-02 -6.741409  2.650981e-11
## vc1000       1.354641e-05 5.726695e-06  2.365485  1.819705e-02
## rc1000       2.522164e-01 7.994502e-03 31.548735 4.693366e-152
#print(c('beta0',summary(fit200)$coefficients[1,1]))
#print(c('beta1',summary(fit200)$coefficients[2,1]))
qqnorm(fit1000$residuals)
qqline(fit1000$residuals)

plot(fit1000)

#influencePlot(fit1000)

Consider the first 2000 observations

fit2000 = lm( v2000 ~ vc2000 + rc2000, data = vcrcv)
summary(fit2000)
## 
## Call:
## lm(formula = v2000 ~ vc2000 + rc2000, data = vcrcv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.761  -0.513  -0.077   0.340 110.863 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.606e-01  8.204e-02  -4.395 1.16e-05 ***
## vc2000       9.996e-05  7.374e-06  13.556  < 2e-16 ***
## rc2000       2.143e-01  1.072e-02  19.996  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.162 on 1997 degrees of freedom
## Multiple R-squared:  0.3144, Adjusted R-squared:  0.3137 
## F-statistic: 457.9 on 2 and 1997 DF,  p-value: < 2.2e-16
cat('Confidence Interval ',confint(fit2000))
## Confidence Interval  -0.5214983 8.549822e-05 0.1932803 -0.1997106 0.0001144211 0.2353156
summary(fit2000)$coefficients
##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) -3.606044e-01 8.204045e-02 -4.395446 1.163475e-05
## vc2000       9.995965e-05 7.373948e-06 13.555785 4.113586e-40
## rc2000       2.142980e-01 1.071699e-02 19.996103 3.138031e-81
#print(c('beta0',summary(fit2000)$coefficients[1,1]))
#print(c('beta1',summary(fit2000)$coefficients[2,1]))
qqnorm(fit2000$residuals)
qqline(fit2000$residuals)

plot(fit2000)

Take all observations

fitall = lm( v ~ vc + rc, data = vcrcv)
summary(fitall)
## 
## Call:
## lm(formula = v ~ vc + rc, data = vcrcv)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -44.88  -0.58  -0.13   0.20 392.49 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.309e-01  5.070e-02  -6.527 7.04e-11 ***
## vc           1.079e-04  4.493e-06  24.022  < 2e-16 ***
## rc           2.237e-01  2.625e-03  85.214  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.927 on 9997 degrees of freedom
## Multiple R-squared:  0.4761, Adjusted R-squared:  0.476 
## F-statistic:  4543 on 2 and 9997 DF,  p-value: < 2.2e-16
cat('Confidence Interval ',confint(fitall))
## Confidence Interval  -0.4303247 9.913496e-05 0.2185175 -0.2315458 0.000116751 0.2288075
summary(fitall)$coefficients
##                 Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept) -0.330935282 5.070370e-02 -6.526846  7.042642e-11
## vc           0.000107943 4.493429e-06 24.022411 5.076744e-124
## rc           0.223662468 2.624725e-03 85.213687  0.000000e+00
#print(c('beta0',summary(fitall)$coefficients[1,1]))
#print(c('beta1',summary(fitall)$coefficients[2,1]))
qqnorm(log(fitall$residuals))
## Warning in log(fitall$residuals): NaNs produced
qqline(log(fitall$residuals))
## Warning in log(fitall$residuals): NaNs produced

plot(fitall)

Another Diagnostic (Under investigation)

## [[1]]
## 
## [[2]]
## 
## [[3]]
## 
## [[4]]